String Transformation Learning
نویسندگان
چکیده
String t ransformat ion systems have been introduced in (Brill, 1995) and have several applications in natural language processing. In this work we consider the computat ional problem of automatical ly learning from a given corpus the set of transformat ions presenting the best evidence. We introduce an original da ta structure and efficient algori thms that learn some families of t ransformations that are relevant for part-of-speech tagging and phonological rule systems. We also show that the same learning problem becomes NP-hard in cases of an unbounded use of don ' t care symbols in a t ransformation. 1 I n t r o d u c t i o n Ordered sequences of rewriting rules are used in several applications in natural language processing, including phonological and morphological systems (Kaplan and Kay, 1994), morphological disambiguation, part-of-speech tagging and shallow syntactic parsing (Brill, 1995), (Karlsson et ah, 1995). In (Brill, 1995) a learning paradigm, called errordriven learning, has been introduced for automat ic induction of a specific kind of rewriting rules called transformations, and it has been shown that the achieved accuracy of the resulting transformation systems is competi t ive with tha t of existing systems. In this work we further elaborate on the errordriven learning paradigm. Our main contribution is summarized in what follows. We consider some families of t ransformations and design efficient algori thms for the associated learning problem that improve existing methods. Our results are achieved by exploiting a da ta structure originally introduced in this work. This allows us to simultaneously represent and test the search space of all possible transformations. The t ransformations we investigate make use of classes of symbols, in order to generalize regularities in rule applications. We also show that when an unbounded number of these symbol classes are allowed within a transformation, then the associated learning problem becomes NP-hard. The notat ion we use in the remainder of the paper is briefly introduced here. ~3 denotes a fixed, finite alphabet and e the null string. E* and E+ are the set of all strings and all non-null strings over E, respectively. Let w 6 E*. We denote by Iwl the length o f w . Let w = uxv; u i s a p r e f i x and v is a suffix of w; when x is non-null, it is called a f a c t o r of w. The suffix of w of length i is denoted suf f i (w) , for O < i _< Iwl. Assume that x is non-null, and w = u ixsu f f i (w ) for ~ > 0 different values of i but not for ~ + 1, or x is not a factor of w and ~ = 0. Then we say that ~ is the statistic of factor z in w. 2 T h e l e a r n i n g p a r a d i g m The learning paradigm we adopt is called errordriven learning and has been originally proposed in (Brill, 1995) for part of speech tagging applications. We briefly introduce here the basic assumptions of the approach. A string t r a n s f o r m a t i o n is a rewriting rule denoted as u -* v, where u and v are strings such that [u[ = Ivt. This means that i fu appears as a factor of some string w, then u should be replaced by v in w. The application of the t ransformation might be conditioned by the requirement tha t some additionally specified pattern matches some part of the string w to be rewritten. We now describe how transformations can be automatically learned. A pair of strings (w, w ' ) is an a l i g n e d p a i r if IT[ = ]w'[. When w = uzsu f f i (w) , w' = u ' x ' su f f i (w ' ) and Ixl = Ix'l, we say that factors x and x' occur at aligned positions within (w, w'). A multi-set of aligned pairs is called an a l i g n e d c o r p u s . Let (w, w ') be an aligned pair and let 7be some transformation of the form u --~ v. The p o s i t i v e e v i d e n c e of v (w.r.t. (w, w')) is the number of different positions at which factors u and v are aligned within (w, w'). The n e g a t i v e ev i d e n c e of r (w.r.t. w, w ~) is the number of different positions at which factors u and u are aligned within
منابع مشابه
تبدیلات دوگانگی آبلی استاندارد در گرانش f(T)
According to the perturbation order, the equations of motion of low-energy string effective action are the generalized Einstein equations. Thus, by making use of the conformal transformation of the metric tensor, it is possible to map the low-energy string effective action into f(T) gravity, relating the dilaton field to the torsion scalar. Considering a homogeneous and isotropic universe and ...
متن کاملA Numerical Simulation Study on Wellbore Temperature Field of Water Injection in Highly Deviated Wells
According to the temperature distribution of water injection well-bore in highly deviated wells under different conditions and unstable temperature field heat conduction principles, a true three-dimensional model was established to analyze the law of variation on temperature of highly deviated wells during the water injection process, and to analyze the factors that influence the water injectio...
متن کاملHeuristical Coding of String Transformations
As a part of a study on statistical grammar learning, the word inflection is investigated in this article. The word inflection is used to create different grammatical instances of the word. In this paper, the different coding alternatives to describe the strings and the string transformations are investigated. The proposed methods are tested on a natural language, the Hungarian language.
متن کاملAutomatic Transformation of Raw Clinical Data Into Clean Data Using Decision Tree Learning Combining with String Similarity Algorithm
It is challenging to conduct statistical analyses of complex scientific datasets. It is a timeconsuming process to find the relationships within data for whether a scientist or a statistician. The process involves preprocessing the raw data, the selection of appropriate statistics, performing analysis and providing correct interpretations, among which, the data pre-processing is tedious and a p...
متن کاملPredicting a Correct Program in Programming By Example
We study the problem of efficiently predicting a correct program from a large set of programs induced from few input-output examples in Programming-by-Example (PBE) systems. This is an important problem for making PBE systems usable so that users do not need to provide too many examples to learn the desired program. We first characterize the three main types of expressions used for expression s...
متن کاملLearning k-Variable Pattern Languages Efficiently Stochastically Finite on Average from Positive Data
The present paper presents a new approach of how to convert Gold-style [4] learning in the limit into stochastically finite learning with high confidence. We illustrate this approach on the concept class of all pattern languages. The transformation of learning in the limit into stochastically finite learning with high confidence is achieved by first analyzing the Lange–Wiehagen [7] algorithm wi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997